Modeling of term-distance and term-occurrence information for improving n-gram language model performance

نویسندگان

  • Tze Yuang Chong
  • Rafael E. Banchs
  • Chng Eng Siong
  • Haizhou Li
چکیده

In this paper, we explore the use of distance and co-occurrence information of word-pairs for language modeling. We attempt to extract this information from history-contexts of up to ten words in size, and found it complements well the n-gram model, which inherently suffers from data scarcity in learning long history-contexts. Evaluated on the WSJ corpus, bigram and trigram model perplexity were reduced up to 23.5% and 14.0%, respectively. Compared to the distant bigram, we show that word-pairs can be more effectively modeled in terms of both distance and occurrence.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving n-gram modeling using distance-related unit association maximum entropy language modeling

In this paper, a distance-related unit association maximum entropy (DUAME) language modeling is proposed. This approach can model an event (unit subsequence) using the co-occurrence of full distance unit association (UA) features so that it is able to pursue a functional approximation to higher order N-gram with signi cantly less memory requirement. A smoothing strategy related to this modeling...

متن کامل

N-Gram Models

Key Points In automatic speech recognition, n-grams are important to model some of the structural usage of natural language, i.e., the model uses word dependencies to assign a higher probability to ‘‘how are you today’’ than to ‘‘are how today you,’’ although both phrases contain the exact same words. If used in information retrieval, simple unigram language models (n-gram models with n 1⁄4 1),...

متن کامل

TDTO language modeling with feedforward neural networks

In this paper, we describe the use of feedforward neural networks to improve the term-distance term-occurrence (TDTO) language model, previously proposed in [1]−[3]. The main idea behind the TDTO model proposition is to model separately both position and occurrence information of words in the history-context to better estimate n-gram probabilities. Neural networks have been shown to offer a bet...

متن کامل

A study of term weighting in phonotactic approach to spoken language recognition

In the spoken language recognition approach of modeling phonetic lattice with the Support Vector Machine (SVM), term weighting on the supervector of N-gram probabilities is critical to the recognition performance because the weighting prevents the SVM kernel from being dominated by a few large probabilities. We investigate several term weighting functions that are used in text retrieval, which ...

متن کامل

Improved topic-dependent language modeling using information retrieval techniques

N-gram language models are frequently used by the speech recognition systems to constrain and guide the search. N-gram models use only the last N-1 words to predict the next word. Typical values of N that are used range from 2-4. N-gram language models thus lack the long-term context information. We show that the predictive power of the N-gram language models can be improved by using long-term ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013